Chemometric Strategies for Fully Automated Interpretive Method Development in Liquid Chromatography

The majority of liquid chromatography (LC) methods are still developed in a conventional manner, that is, by analysts who rely on their knowledge and experience to make method development decisions. In this work, a novel, open-source algorithm was developed for automated and interpretive method development of LC(−mass spectrometry) separations (“AutoLC”). A closed-loop workflow was constructed that interacted directly with the LC system and ran unsupervised in an automated fashion. To achieve this, several challenges related to peak tracking, retention modeling, the automated design of candidate gradient profiles, and the simulation of chromatograms were investigated. The algorithm was tested using two newly designed method development strategies. The first utilized retention modeling, whereas the second used a Bayesian-optimization machine learning approach. In both cases, the algorithm could arrive within 4–10 iterations (i.e., sets of method parameters) at an optimum of the objective function, which included resolution and analysis time as measures of performance. Retention modeling was found to be more efficient while depending on peak tracking, whereas Bayesian optimization was more flexible but limited in scalability. We have deliberately designed the algorithm to be modular to facilitate compatibility with previous and future work (e.g., previously published data handling algorithms).


S-1 Samples
Sample A was an antibody digest dissolved in buffer with an earlier reported minimum number of 189 compounds 1 .
For Bayesian optimization, Sample B was a mixture of 80 dye reference compounds was obtained from the Dutch Cultural Heritage Agency, and used and prepared as reported earlier 2 . Sample B contained at least 158 components. To prepare the sample mixture for retention modelling, all dyes for Sample B were dissolved at a concentration of 50 ppm in eluent B. S-4

S-2 Peak tracking results for LC-MS measurements of antibody digest sample
These are the peak tracking results for the optimization of the LC-MS of Sample A (antibody digest) on System B. The first item is the peak tracking table. The next items are the resulting chromatograms for each iteration with peak tracking results. Note that due to the retrack occurring during iterations 4 and 9, chromatograms of the three groups of iteration 1-3, 4-8 and 9-12 each feature different number annotations relative to the other groups.       S-15 The algorithm would progress with further iterations using the interior-point algorithm until the successive change in the sum of squared errors (SSE) was below 10 -6 . For each analyte, the best fit from the 20 loops was transferred to the next phase for use in optimization.  As init was allowed to vary between 0.05 and 0.30, and 2 , 3 , 4 and 5 were all allowed to vary between 0.05 and 0.95, whereas 6 was varied between 0.80 and 0.95. All were allowed to vary between 0.01 and 25 min, and all , between 0.5 and 25 min. Afterwards, the method concluded and was set to φ1 (0.02). There was no controlled re-equilibration, but due to the processing time of the algorithm, the column would receive at least 4 column volumes of eluent A prior to the start of the next iteration.

S-4 Overview of gradient program
A minimum of 0.05 or 0.02 is common in RPLC and is purely based on the recommendations by the column manufacturer. This is done so the column stationary phase will stay in good condition. init was maximized to 0.3 to minimize the search space of the start conditions. This is all done to improve the calculation speed of the algorithm to converge to an optimal gradient program faster (i.e. it saves minutes of computational time). These bounds can be omitted without any problem, but RPLC separations often start with a relatively low modifier fraction. Similarly, the high ranges were also pre-selected to ensure that all analytes will elute from the column. Again, this was done to allow the algorithm to be universally applied.

S-6 Prediction errors in retention time and peak width
The graphs in this section display reconstructions of predicted chromatograms. Note that in these graphs the peak intensity was normalized. The graphs do provide a useful insight in retention times and peak widths. The blue stars depict the magnitude of the prediction error in retention time, whereas the red starts depict this for the peak width (sigma). The second yaxis provides numerical scale to both of these prediction errors. The error in peak width appears always to be below 1%. A systematic negative bias becomes apparent. The error in retention time is, however, rather significant in MDI 4 and MDI 5 for the less retained analytes. This is most likely due to the fact that these predictions are based on peak data which was heavily convoluted in the scouting MDI (i.e. all co-eluting early).

S-7 Minimum peak width
Convoluted peaks are detected with a too low peak width if no deconvolution is performed. To ensure a good separation without the addition of deconvolution, which increases the computational time significantly, we implemented a minimal peak width at the base of 0.3 min. Unfortunately, this impaired the algorithm in its capabilities to predict the true width which was often much smaller. Future iterations of this work will have this fixed.

S-8 In-depth study of retention modeling using UV-vis data and the quadratic model
We also investigated whether the number of retention data (i.e. the number of previous MDI) significantly affected the model. We employed the quadratic model (Equation  Usually, longer (or shallower) gradients yield better separation, which is also reflected by the first five MDI. Optimization algorithms will therefore often prefer such gradients when the number of gradient segments is limited. However, after MDI 6, once the algorithm was forced to use multi-segment gradients ( Figure S-20F), the algorithm exclusively proposed relatively short gradients as also indicated by the small bubbles. Figure  4D also showcases an example of how an exit functiona function that decides whether the automated workflow should stopcan be designed. The log fit through the bubbles (dashed, dark blue line) flattens towards the higher number of MDI. This indicates that with every additional MDI less improvement is obtained by continuing the workflow. Using the first derivative of this function (i.e., the slope of the log function) a threshold can be defined. Once the slope is below this threshold, the automation algorithm can exit the iteration loop.
Finally, it can be seen in Figure S-20E that using the LSS model for the same sample the optimization curve flattens more rapidly. Due to the lack of degrees of freedom, the algorithm is incapable to further fine-tune separation conditions to achieve a better perf . S-23

S-9 UV-Vis peak tracking data
The tables below give an overview of the obtained detection and tracking results using the LSS (Table S-2) and Quadratic (Table S-3) models.   S-26

S-12 Algorithm
The *.zip package contains all used code to construct the master algorithm in the various programming languages. The master code is run within Python and configuration for the LC system requires the Automation package freely provided by Agilent. We would like to emphasize that this is a work-in-progress prototype released for transparency and scientific proliferation, and that future (user-friendly) versions will be released on https://www.castamsterdam.org/. To use this prototype significant tailoring of the individual scripts will be required.
The UV-vis toolbox, used in this study for the data in Sections S-8 and S-9, by Denice van Herwerden can be downloaded elsewhere 5 .

Further design considerations
Analytical instruments comprise sophisticated hardware with safety mechanisms in place to avoid improper use. This is pivotal for industrial environments. For this reason, we opted to interface with existing instrument control software (ICS). However, ICS is often the product of several stages of development and continuously adapted to new technology. To prevent the algorithm from requiring adaptation to specific ICS and its procedures, the algorithm was designed to program and activate the ICS after which a listener function was activated, while the algorithm would remain dormant. Once the LC experiment was finished, the ICS was programmed to create a signal for the listener function to reactivate the algorithm.
Next to design considerations such as being independent and interpretive, flexibility towards metrics published in the literature is also an important consideration point. To this end, the algorithm should be modular. This was achieved by defining a chain of independent operations with controlled input and output criteria. For example, any background-correction algorithm in the literature can be used if its input is a raw signal, and its output a processed signal. This is less trivial than it appears as some background-correction algorithms require data-specific parameters to be defined. To be interpretive, each metric must include additional subroutines to interpret the signal characteristics and self-determine the parameters, rendering various strategies more challenging.